DM-39582: Add caching for some butler primitives during deserialization #858

natelust · 2023-06-27T21:25:27Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

Often may butler primitives are deserialized at the same time, and it is useful for these objects to share references to each other. This reduces load time and memory usage.

Downstream code now depends on refs holding UUIDs. Have the yaml loader convert old style integer ids to UUIDs early rather than waiting for downstream cleanups.

codecov · 2023-06-27T21:43:01Z

Codecov Report

Patch coverage: 63.79% and project coverage change: -0.13 ⚠️

Comparison is base (6618b21) 88.01% compared to head (ef41fb5) 87.89%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #858      +/-   ##
==========================================
- Coverage   88.01%   87.89%   -0.13%     
==========================================
  Files         269      270       +1     
  Lines       35420    35544     +124     
  Branches     7424     7452      +28     
==========================================
+ Hits        31176    31241      +65     
- Misses       3103     3145      +42     
- Partials     1141     1158      +17

Impacted Files	Coverage Δ
...hon/lsst/daf/butler/core/dimensions/_coordinate.py	`87.46% <38.46%> (-1.85%)`	⬇️
python/lsst/daf/butler/core/datastoreRecordData.py	`89.41% <45.45%> (-6.59%)`	⬇️
python/lsst/daf/butler/core/datasets/type.py	`84.58% <46.66%> (-2.59%)`	⬇️
python/lsst/daf/butler/core/dimensions/_records.py	`85.80% <52.94%> (-4.00%)`	⬇️
python/lsst/daf/butler/core/persistenceContext.py	`59.25% <59.25%> (ø)`
python/lsst/daf/butler/core/datasets/ref.py	`83.73% <72.72%> (-1.91%)`	⬇️
python/lsst/daf/butler/core/quantum.py	`87.61% <78.94%> (+0.29%)`	⬆️
python/lsst/daf/butler/transfers/_yaml.py	`87.50% <90.90%> (+0.47%)`	⬆️
python/lsst/daf/butler/_butler.py	`78.69% <100.00%> (ø)`
python/lsst/daf/butler/core/__init__.py	`100.00% <100.00%> (ø)`
... and 2 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

andy-slac

Looks OK, one concern is about cache key for DatasetRef maybe needing component. I did not comment on docstring style, I think ruff should not take care of that.

andy-slac · 2023-06-28T00:59:23Z

doc/changes/DM-39582.api.md

@@ -0,0 +1 @@
+Deprecate reconstituteDimensions argument from Quantum.from_simple


I think file name should be removal, not api, according to README.

Add backticks around reconstituteDimensions and Quantum.from_simple and period at the end.

python/lsst/daf/butler/core/datasets/ref.py

andy-slac · 2023-06-28T01:38:29Z

python/lsst/daf/butler/core/datasets/ref.py

        # Minimalist component will just specify component and id and
        # require registry to reconstruct
-        if set(simple.dict(exclude_unset=True, exclude_defaults=True)).issubset({"id", "component"}):
+        if not (simple.datasetType is not None or simple.dataId is not None or simple.run is not None):


Could you rewrite this as simple.datasetType is None and simple.dataId is None and simple.run is None, I think it makes it easier to read?

That is not logically the same thing, We only want to run this when they are all False. But and is greedy, so False will always gobble up anything.

andy-slac · 2023-06-28T01:47:52Z

python/lsst/daf/butler/core/datastoreRecordData.py

+        key = frozenset(dataset_ids)
+        cache = PersistenceContextVars.serializedDatastoreRecordMapping.get()
+        if cache is not None and (value := cache.get(key)) is not None:
+            return value


I would not expect many (or maybe any) cache hits for this. DatastoreRecordData is per-quantum structure, I do not think any two quanta can have the same set of input datasets?

andy-slac · 2023-06-28T02:04:10Z

python/lsst/daf/butler/transfers/_yaml.py

@@ -64,6 +64,8 @@
 this version of the code.
 """

+_refIntId2UUID = defaultdict[int, uuid.UUID](uuid.uuid4)


Is this a workaround for input YAML files that still have integer dataset IDs in them? Should we instead fix those YAML files? It may also be better to generate reproducible UUIDs in that case if you want to keep this map.

I could not be sure all the places where this might be happening, so I opted to fix it here. Good point on the reproducibility I will go with that.

MyPy seems to narrow types somehow when comparing Enum Flags directly with equality operators. Compare by value instead.

natelust added 6 commits June 27, 2023 14:58

Add an object for handling deserialization caches

1310356

Optimize memory and load times on deserialization

97137a2

Often may butler primitives are deserialized at the same time, and it is useful for these objects to share references to each other. This reduces load time and memory usage.

Convert integer ids to UUID early

93e3286

Downstream code now depends on refs holding UUIDs. Have the yaml loader convert old style integer ids to UUIDs early rather than waiting for downstream cleanups.

Black and isort changes

498258d

Convert test to use UUID instead of int

a2805b3

Add release documentation

b849bea

natelust requested a review from andy-slac June 27, 2023 21:26

timj changed the title ~~tickets/DM-39582~~ DM-39582: Add caching for some butler primitives during deserialization Jun 27, 2023

andy-slac approved these changes Jun 28, 2023

View reviewed changes

natelust and others added 4 commits June 28, 2023 16:28

Address formatting/MYPY issues

253a01d

Add some defensive programming to appease mypy

ce5b439

Check that the datastore record has the right class on read

fbf00cb

Fix mypy with flag comparison change

5432818

MyPy seems to narrow types somehow when comparing Enum Flags directly with equality operators. Compare by value instead.

natelust force-pushed the tickets/DM-39582 branch 2 times, most recently from 1178cdb to 11e458f Compare June 30, 2023 22:26

Address review feedback

ef41fb5

natelust force-pushed the tickets/DM-39582 branch from 11e458f to ef41fb5 Compare June 30, 2023 22:29

natelust merged commit d926a5c into main Jul 4, 2023

natelust deleted the tickets/DM-39582 branch July 4, 2023 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-39582: Add caching for some butler primitives during deserialization #858

DM-39582: Add caching for some butler primitives during deserialization #858

natelust commented Jun 27, 2023

codecov bot commented Jun 27, 2023 •

edited

Loading

andy-slac left a comment

andy-slac Jun 28, 2023

andy-slac Jun 28, 2023

natelust Jun 30, 2023

andy-slac Jun 28, 2023

andy-slac Jun 28, 2023

natelust Jun 30, 2023

		@@ -0,0 +1 @@
		Deprecate reconstituteDimensions argument from Quantum.from_simple

DM-39582: Add caching for some butler primitives during deserialization #858

DM-39582: Add caching for some butler primitives during deserialization #858

Conversation

natelust commented Jun 27, 2023

Checklist

codecov bot commented Jun 27, 2023 • edited Loading

Codecov Report

andy-slac left a comment

Choose a reason for hiding this comment

andy-slac Jun 28, 2023

Choose a reason for hiding this comment

andy-slac Jun 28, 2023

Choose a reason for hiding this comment

natelust Jun 30, 2023

Choose a reason for hiding this comment

andy-slac Jun 28, 2023

Choose a reason for hiding this comment

andy-slac Jun 28, 2023

Choose a reason for hiding this comment

natelust Jun 30, 2023

Choose a reason for hiding this comment

codecov bot commented Jun 27, 2023 •

edited

Loading